# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rsample) # General Resampling InfrastructureWB Project Data Preprocessing
Work in progress
Set up
—————————————————————————-
Data sources
WB Projects & Operations
World Bank Projects & Operations can be explored at:
- Data Catalog. From which
Accessibility Classification: public under Creative Commons Attribution 4.0
esempio https://datacatalog.worldbank.org/search/dataset/0037800 https://datacatalog.worldbank.org/search/dataset/0037800/World-Bank-Projects---Operations
Raw data
—————————————————————————
Attempt # 1 {Ingest Projects data (via API)}
DOESN’T WORK!
Attempt # 2.a {Ingest Projects data (manually split)}
I retrieved manually ALL WB projects approved between FY 1973 and 2023 (last FY incomplete) on 09/22/2022 (WDRs go from 1978-2022) using this example url and saved individual .xlsx files in data/raw_data/project
- note the manual download is limited to # = 500
— Load all .xlsx files separately
— Save objs in folder as .Rds files separately
Attempt # 2.b Ingest Projects data (manually all together)!
I retrieved manually ALL WB projects approved between FY 1947 and 2026 as of 31/08/2024 using simply the
Excel buttonon this page WBG Projects-
then saved HUUUGE
.xlsfiles indata/raw_data/project2/all_projects_as_of29ago2024.xls- (plus a
Rdatacopy of the original file )
- (plus a
all_projects_as_of29ago2024 <- read_excel(here::here ("data", "raw_data", "project2","all_projects_as_of29ago2024.xls"),
col_names = FALSE,
skip = 1)
# Nomi delle colonne
cnames <- read_excel(here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.xls"),
col_names = FALSE,
skip = 1,
n_max = 2)
# file completo
all_proj <- read_excel(here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.xls"),
col_names = TRUE,
skip = 2)
save(all_proj, file = here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.Rdata") )
rm(all_projects_as_of29ago2024)Explore Project mega file
_______
CLEANING FILE PDOs
_______
Clean all_proj
This data set has a lot of blank values, probably also bc some information was not collected way back in 1947… (e.g. PDO)
# Mess of data format weird in different ways in 2 cols:
# 1947-12-31 12:00:00 # closingdate
# 8/3/1948 12:00:00 AM # closingdate
#
# 1955-03-15T00:00:00Z # boardapprovaldate
# Mutate the date columns to parse the dates, handling different formats and blanks
all_proj_t <- all_proj %>%
# 1) Parsed using parse_date_time() with mdy HMS and mdy HMSp to handle "MM/DD/YYYY HH:MM AM/PM" formats.
mutate(across("closingdate", ~ if_else(
. == "",
NA_POSIXct_, # Return NA for blank entries
parse_date_time(., orders = c("mdy HMS", "mdy HMSp"))
)),
# 2) Parsed using ymd_hms() because it follows the ISO 8601 format (e.g., "1952-04-29T00:00:00Z").
across("boardapprovaldate", ~ if_else(
. == "",
NA_POSIXct_, # Return NA for blank entries
ymd_hms(., tz = "UTC") # Handle ISO 8601 format (e.g., "1952-04-29T00:00:00Z")
))) %>%
mutate(boardapproval_year = year(boardapprovaldate),
boardapproval_month = month(boardapprovaldate)) %>%
mutate(boardapprovalFY = case_when(
boardapproval_month >= 1 & boardapproval_month < 7 ~ boardapproval_year,
boardapproval_month >= 7 & boardapproval_month <= 12 ~ boardapproval_year +1)) %>%
relocate(boardapprovalFY, .after = boardapprovaldate ) %>%
mutate(closingdate_year = year(closingdate),
closingdate_month = month(closingdate)) %>%
mutate(closingdateFY = case_when(
closingdate_month >= 1 & closingdate_month < 7 ~ closingdate_year,
closingdate_month >= 7 & closingdate_month <= 12 ~ closingdate_year +1)) %>%
relocate(closingdateFY, .after = closingdate )
tabyl(all_proj$closingdate)
tabyl(all_proj_t$closingdateFY)
tabyl(all_proj$boardapprovaldate)
tabyl(all_proj_t$boardapprovalFY)Explore who are the ones with no PDO
# Function to count missing values in a subset of columns
count_missing_values <- function(data, columns) {
# Select the subset of columns
df_subset <- data %>% select(all_of(columns))
# Use skimr to skim the data
skimmed <- skim(df_subset)
# Extract the relevant columns for column names and missing values
missing_table <- skimmed %>%
select(skim_variable, n_missing)
# Return the table
return(missing_table)
}
# Use the function on a subset of columns
count_missing_values(all_proj_t, c("pdo", "projectstatusdisplay", "boardapprovalFY", "sector1", "theme1"))
missing_pdo <- all_proj_t %>%
#select(id, pdo, countryname, projectstatusdisplay, lendinginstr, boardapprovalFY, projectfinancialtype) %>%
filter(is.na(pdo))
# Now I compare to get a sense of distribution in all_proj_t v. missing_pdo...
tabyl(all_proj_t$projectstatusdisplay) %>% adorn_pct_formatting()
tabyl(missing_pdo$projectstatusdisplay) %>% adorn_pct_formatting()
tabyl(all_proj_t$regionname) %>% adorn_pct_formatting()
tabyl(missing_pdo$regionname) %>% adorn_pct_formatting()
tabyl(all_proj_t$boardapprovalFY) %>% adorn_pct_formatting()
tabyl(missing_pdo$boardapprovalFY) %>% adorn_pct_formatting()
tabyl(all_proj_t$projectfinancialtype) %>% adorn_pct_formatting()
tabyl(missing_pdo$projectfinancialtype) %>% adorn_pct_formatting()
tabyl(all_proj_t$sector1) %>% adorn_pct_formatting()
tabyl(missing_pdo$sector1) %>% adorn_pct_formatting()
tabyl(all_proj_t$theme1) %>% adorn_pct_formatting()
tabyl(missing_pdo$theme1) %>% adorn_pct_formatting() # most NA
#Environmental Assessment Category
tabyl(all_proj_t$envassesmentcategorycode) %>% adorn_pct_formatting() # most NA
tabyl(missing_pdo$envassesmentcategorycode) %>% adorn_pct_formatting()
# Environmental and Social Risk
tabyl(all_proj_t$esrc_ovrl_risk_rate) %>% adorn_pct_formatting() # most NA
tabyl(missing_pdo$esrc_ovrl_risk_rate) %>% adorn_pct_formatting()
tabyl(all_proj_t$lendinginstr) %>% adorn_pct_formatting() # Specific Investment Loan 4928 43.9%
tabyl(missing_pdo$lendinginstr) %>% adorn_pct_formatting() # Specific Investment Loan 4928 43.9%Based on some “critical” category, I would say that even if many projects are missing PDO the incidence seems to happen at random, except maybe for lendinginstr specific Investment Loan are missing PDO in 4928 pr (43.9%). Why?
https://stackoverflow.com/questions/71608612/how-to-add-keybindings-in-visual-studio-code-for-the-r-terminal
_______
PREPROCESSING
_______
Obtain Reduced df projs
For my purposes it is safe to drop all the projects with missing PDO !
- it turns out there are no Development objectives spelled out until FY2001
projs <- all_proj_t %>%
filter(!is.na(pdo)) %>%
filter(!is.na(projectstatusdisplay)) %>%
filter(boardapprovalFY < 2024 & boardapprovalFY >1972) %>%
select(id, pr_name = project_name, pdo, boardapprovalFY, closingdateFY,status = projectstatusdisplay, regionname, countryname, sector1, theme1 ,
lendinginstr,env_cat = envassesmentcategorycode, ESrisk = esrc_ovrl_risk_rate ,curr_total_commitment )
tabyl(projs$boardapprovalFY) %>% adorn_pct_formatting() # most NA
nrow(projs) # 8836
paint(cnames)
paint(projs)
rm( "all_proj", "all_proj_t" , "cnames" , "count_missing_values", "missing_pdo" )Manual correction text [projs]
projs$pdo[projs$id == "P164414"] <- "The Multisector Development Policy Financing (DPF) intends to support Ukraine's highest priority reforms to move from economic stabilization to stronger and sustained economic growth by addressing deeper structural bottlenecks and governance challenges in key areas. Possible policy areas include : (i) strengthening private sector competitiveness, including reforming land markets and the financial sector; (ii) promoting sustainable and effective public services, including reforming pensions, social assistance, and health; and (iii ) improving governance, including reforming anticorruption institutions and tax administration. The financing DPL or Policy Based Guarantee (PBG)."
projs$pdo[projs$id == "P111432"] <- "Project development objectives for RCIP 3 include the following: Malawi: Support the Recipient's efforts to improve the quality, availability and affordability of broadband within its territory for both public and private users. Mozambique: Support the Recipient's efforts to contribute to lower prices for international capacity and extend the geographic reach of broadband networks and to contribute to improved efficiency and transparency through eGovernment applications. Tanzania: Support the Recipient's efforts to: (i) lower prices for international capacity and extend the geographic reach of broadband networks; and (ii) improve the Government's efficiency and transparency through eGovernment applications."
projs$pdo[projs$id == "P252350"] <- "The Program Development Objective is to expand opportunities for the acquisition of quality, market-relevant skills in selected economic sectors. The selected economic sectors include Energy, Transport and Logistics, and Manufacturing (with a focus on ‘Made-Rwanda’ products such as construction materials, light manufacturing and agro-processing). Building skills to advance the country’s economic agenda is a key priority of the GoR’s ongoing Economic Development and Poverty Reduction Strategy-2 (EDPRS2) launched in 2013. EDPRS2 builds on the country’s Vision 2020 which seeks to transform the country by raising its per capita GDP to middle-incomelevel by 2020. The Program is grounded in the Government of Rwanda’s (GoR) National Employment Programs (NEP) approved by Cabinet in 2014. NEP was designed to address the employment challenges in Rwanda and equip its population with the skills required to supporteconomic development. The main results areas of the operation are: (i) reinforcing governance of the skills development system; (ii) ensuring provision of quality training programs with market relevance; (iii) expanding opportunities for continuous upgrading of job-relevant skills for sustained employability; and (iv) capacity building for implementation. The Program will disburse against achievement of specific Disbursement Linked Results (DLRs) in these results areas"_______
SPLITTING SAMPLE
_______
(also to work on something smaller)
# ensure we always get the same result when sampling (for convenience )
set.seed(12345)
# use `regionname` as strata
tabyl(projs$regionname)
projs_split <- projs %>%
# define the training proportion as 75%
rsample::initial_validation_split(prop = c(0.50, 0.25),
# ensuring both sets are balanced in gender
strata = regionname)
# resulting 3 datasets
projs_train <- rsample::training(projs_split)
projs_val <- rsample::validation(projs_split)
projs_test <- rsample::testing(projs_split)
tabyl(projs_train$regionname)
tabyl(projs_val$regionname)
tabyl(projs_test$regionname)_______
TEXT ANALYSIS
_______
i) Tokenization
Where a word is more pdotract, a “type” is a concrete term used in actual language, and a “token” is the particular instance we’re interested in (e.g. pdotract things (‘wizards’) and individual instances of the thing (‘Harry Potter.’). Breaking a piece of text into words is thus called “tokenization”, and it can be done in many ways.
— The choices of tokenization
- Should words be lowercased? x
- Should punctuation be removed? x
- Should numbers be replaced by some placeholder?
- Should words be stemmed (also called lemmatization). x
- Should bigrams/multi-word phrase be used instead of single word phrases?
- Should stopwords (the most common words) be removed? x
- Should rare words be removed?
— Tokenization 1 PDO | regular expression
The R function strsplit lets us do just this: split a string into pieces. *Note, for example, that this makes the word “Don’t” into two words.
— Tokenization 1 PDO | tidytext (ILLUSTRATION)
The simplest way is to remove anything that isn’t a letter. The workhorse function in tidytext is unnest_tokens. It creates a new columns (here called ‘words’) from each of the individual ones in text.
pdo_1 <- as_tibble(projs_train$pdo[1] )
# LIST OF features I can add to `unnest_tokens`
tok_feat_l <- list(
# 1) all 2 lowercase
pdo_1 %>% unnest_tokens(word, value) %>%
select(lowercase = word),
# 4) `SnowballC::wordStem` extracts stems of each given words in the vector.
pdo_1 %>% unnest_tokens(word, value) %>% rowwise() %>% mutate(word = SnowballC::wordStem(word)) %>%
select(stemmed = word),
# 1.b) keep uppercase if there are
pdo_1 %>% unnest_tokens(word, value, to_lower = F) %>%
select(uppercase = word),
# 2) keep punctuation {default is rid}
pdo_1 %>% unnest_tokens(word, value, to_lower = F, strip_punc = FALSE) %>%
select(punctuations = word),
# 5) bigram
pdo_1 %>% unnest_tokens(word, value, token = "ngrams", n = 2, to_lower = F) %>%
select(bigrams = word)
)
# Return a data frame created by column-binding.
tok_feat_df <- map_dfc(tok_feat_l , ~ .x %>% head(50))
tok_feat_df
# # my choice
# pdo_1_t_mod <- pdo_1 %>%
# # no punctuation, yes capitalized
# unnest_tokens(word, value, to_lower = F, strip_punc = TRUE) %>% # 249 obs
# # exclude stopwords
# anti_join(stop_words) # 109 obs
#
# head(pdo_1_t_mod, 15)— Tokenize train PDOs | tidytext
# pdo_train_token <- projs_train %>% # 4416
# ungroup() %>%
# # Drop some useless columns PDOs
# dplyr::select(id, boardapprovalFY, pr_name, regionname, pdo) %>%
# tidytext::unnest_tokens(output = word,
# token = "words",
# input = pdo ,
# to_lower = T, # otherwise cannot match the stop_words
# strip_punc = TRUE,
# drop = F # keep original text col (input)
# ) %>% # 221,456
# relocate (pdo, .before = "word") # 220,777But I want to preserve the Hyphenated words so:
community-based ecc remain
pdo_train_token <- projs_train %>% # 4416
ungroup() %>%
# Drop some useless columns PDOs
dplyr::select(id, boardapprovalFY, pr_name, regionname, pdo) %>%
# Step 1: Replace hyphens with a placeholder (e.g., "HYPHEN")
mutate(pdo_modified = str_replace_all(pdo, "-", "HYPHEN")) %>%
# Step 2: Unnest tokens, with punctuation stripping (but hyphens are now preserved)
tidytext::unnest_tokens(output = word,
token = "words",
input = pdo_modified ,
to_lower = T, # otherwise cannot match the stop_words
strip_punc = TRUE,
drop = F # keep original text col (input)
) %>% # 218,193
select(-pdo_modified) %>%
relocate (pdo, .before = "word") %>%
# Step 3: Replace the placeholder back with a hyphen
mutate(word = str_replace_all(word, "hyphen", "-")) %>%
mutate(word = str_replace_all(word, "covid-19", "covid19")) # 218,193
# Now `df_tokens` will have hyphenated words preserved— Default tidytext packg stop_words
# default stopwords that come with the tidytext package t
sw <- tidytext::stop_words
paint(stop_words)— My own custom_stop_words |
Remove stop words, which are the most common words in a language.
# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
# Articles
"the", "a", "an",
"and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",
# Prepositions
"of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through",
"during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
"over", "again", "further", "then", "once",
# Pronouns
"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
"yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her",
"hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
"this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
"when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
# "some", "such", "no", "not",
# "too", "very",
# verbs
"is", "are", "would", "could", "will", "be"
)
# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)— Remove stop_words from train PDOs
— Other unwanted tokens
You may want to use your own curated list
- no numbers (not needed in this context)
- no
text's(TURNED TOtext) - no units of measurement
First, I check which they are…
# further restriction on the words
pdo_train_tok <- pdo_train_token %>% # 142,195
mutate(word_original = word) %>%
relocate(word_original, .after = pdo) %>%
#### ------ NO `numbers`
# The regex "\\d" detects any digit (0-9)
# The regex "^\\d*\\.?\\d+$" any string that consists only of digits (with an optional decimal point)
mutate (word_num = str_detect(word_original, "^\\d*\\.?\\d+$")) %>%
# The regex "^\\d*\\.?\\d+$" match numbers with no letters in the cell, allowing for both decimal points and thousands separators
mutate (word_num2 = str_detect(word_original, "^\\d{1,3}(,\\d{3})*(\\.\\d+)?$")) %>%
#### ------ NO punctuation signs (except for hyphens)
# The regex ""[[:punct:]]"" match numbers that any string that contains at least one punctuation symbol or sign.
mutate (word_punct2 = str_detect(word_original, "[[:punct:]]") & !str_detect(word_original, "^[[:alpha:]]+-[[:alpha:]]+$")) %>%
#### ------ NO hyphen with nothing else (redundant for above )
mutate (word_hyp = str_detect(word_original, "^-$")) %>%
#### ------ NO units
mutate(word_units = str_detect(word_original, "\\b(usd|mw|gw|kwh|1,2,3)\\b")) %>%
#### ------ `text's` (TURNED TO `text` )
#### 1/2 ....... contains `'s`
mutate(word_s = str_detect(word_original, "\\b\\w+'s\\b")) %>%
#### 2/2 ....... looks for any word ending with 's and REPLACES it with JUST the word before the apostrophe
mutate(word = str_replace_all(word_original, "\\b(\\w+)'s\\b", "\\1")) … then I get rid of the unwanted tokens
pdo_train_t <- pdo_train_tok %>% # 138,210
# get rid of numbers and other non meaningful words....
filter (word_num == FALSE) %>% # ... > 139,406
filter (word_num2 == FALSE) %>% # ... > 139,300
filter (word_punct2 == FALSE) %>% # ... > 137,693
# filter (word_hyp == FALSE) %>% # ... > 139,144 (redudant with above)
filter (word_units == FALSE) %>% # ... > 135,129
# DROP temporary cols
select (-word_num, -word_num2, -word_punct2, -word_hyp, -word_units, -word_s)
# Count words
count_train <- pdo_train_t %>%
count(word, sort = TRUE) # 11,201--> 10,345ii) Word stemming
to reduce them to their word stem or root form
- this sucks!
_______
TEXT ANALYSIS/SUMMARY
_______
_______
>>>>>> QUI <<<<<<<<<<<<<<<<<<
rivedere cos’avevo fatto x pulire in analysis//03_WDR_pdotracs_explor.qmd https://cengel.github.io/R-text-analysis/textprep.html#detecting-patterns https://guides.library.upenn.edu/penntdm/r https://smltar.com/stemming#how-to-stem-text-in-r BOOK STEMMING
START FROM ## III.i) Tokenization
_______
see https://cengel.github.io/R-text-analysis/textanalysis.html
Frequencies of documents/words/stems
Word freq ggplot
pdo_train_t %>%
filter (!(word %in% c("pdo","project", "development", "objective", "i","ii", "iii"))) %>%
count(word) %>%
filter(n > 500) %>%
mutate(word = reorder(word, n)) %>% # reorder values by frequency
ggplot(aes(word, n)) +
geom_col(fill = "gray") +
coord_flip() # flip x and y coordinates so we can read the words betterStem freq ggplot
pdo_train_t %>%
filter (!(stem %in% c("pdo","project", "development", "objective", "i","ii", "iii"))) %>%
count(stem) %>%
filter(n > 500) %>%
mutate(stem = reorder(stem, n)) %>% # reorder values by frequency
ggplot(aes(stem, n)) +
geom_col(fill = "gray") +
coord_flip() # flip x and y coordinates so we can read the words betterWe can pipe this into ggplot to make a graph of the words that occur more that 2000 times. We count the words and use geom_col to represent the n values.
Isolate sector words and see frequency over years
df <- pdo_train_t %>%
filter (stem %in% c("water", "transport", "urban", "energi", "health")) %>%
mutate (FY = boardapprovalFY) %>%
# group_by(FY) %>%
#summarize (n_rep = length(stem)) %>%
count(FY, stem)
#df$FY
ggplot(data = df, aes(x = FY, y = n, group = stem, color = stem)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(2001, 2023, by= 2)) +
scale_color_viridis_d(option = "magma", end = 0.9) +
facet_wrap(~stem, ncol = 2, scales = "free")+ guides(color = FALSE) +
theme_bw()+
theme(# Adjust angle and alignment of x labels
axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Sector words frequency in PDO over Fiscal Years",x = "Board approval FY", y = "Counts of 'sector' word (stem)") +
geom_vline(data = subset(df, stem == "health"), aes(xintercept = 2020),
linetype = "dashed", color = "#9b6723") +
geom_text(data = subset(df, stem == "health"), aes(x = 2020, y = max(df$n)*0.85, label = "Covid"),
angle = 90, vjust = -0.5, color = "#9b6723")Term frequency
Word and document frequency: Tf-idf
The goal is to quantify what a document is about. What is the document about?
- term frequency (tf) = how frequently a word occurs in a document… but there are words that occur many time and are not important
- term’s inverse document frequency (idf) = decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
- statistic tf-idf (= tf-idf) = an alternative to using stopwords is the frequency of a term adjusted for how rarely it is used. [It measures how important a word is to a document in a collection (or corpus) of documents, but it is still a rule-of-thumb or heuristic quantity]
The tf-idf is the product of the term frequency and the inverse document frequency::
N-Grams
…
Co-occurrence
…
_______
TOPIC MODELING
_______
Topic modeling is an unsupervised machine learning technique used to hat exploratively identifies latent topics based on frequently co-occurring words.
It can identify topics or themes that occur in a collection of documents, allowing hidden patterns and relationships within text data to be discovered. It is widely applied in fields such as social sciences and humanities.
https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html
https://m-clark.github.io/text-analysis-with-R/topic-modeling.html
https://sicss.io/2020/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html
Document-Term Matrix
…
Latent Dirichlet Allocation (LDA)
… SPIEGA https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
/include independent variables in my topic model?
https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html#how-do-i-include-independent-variables-in-my-topic-model
_______
STRUCTURAL TOPIC MODELING (STM)
_______
The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both.
MAIN REFERENCE stm R package http://www.structuraltopicmodel.com/ EXAMPLE UN corpus https://content-analysis-with-r.com/6-topic_models.html STM 1/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-i-2da2b353d362 STM 2/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-ii-462e6e07328
BERTopic
Developed by Maarten Grootendorst, BERTopic enhances the process of discovering topics by using document embeddings and a class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF).
https://medium.com/(supunicgn/a-beginners-guide-to-bertopic-5c8d3af281e8?)
_______
(dYnamic) TOPIC MODELING OVER TIME
_______
Example: An analysis of Peter Pan using the R package koRpus https://ladal.edu.au/topicmodels.html#Topic_proportions_over_time